Perception of speaker sincerity in complex social interactions by cochlear implant users

Understanding insincere language (sarcasm and teasing) is a fundamental part of communication and crucial for maintaining social relationships. This can be a challenging task for cochlear implant (CIs) users who receive degraded suprasegmental information important for perceiving a speaker’s attitude. We measured the perception of speaker sincerity (literal positive, literal negative, sarcasm, and teasing) in 16 adults with CIs using an established video inventory. Participants were presented with audio-only and audio-visual social interactions between two people with and without supporting verbal context. They were instructed to describe the content of the conversation and answer whether the speakers meant what they said. Results showed that subjects could not always identify speaker sincerity, even when the content of the conversation was perfectly understood. This deficit was greater for perceiving insincere relative to sincere utterances. Performance improved when additional visual cues or verbal context cues were provided. Subjects who were better at perceiving the content of the interactions in the audio-only condition benefited more from having additional visual cues for judging the speaker’s sincerity, suggesting that the two modalities compete for cognitive recourses. Perception of content also did not correlate with perception of speaker sincerity, suggesting that what was said vs. how it was said were perceived using unrelated segmental versus suprasegmental cues. Our results further showed that subjects who had access to lower-order resolved harmonic information provided by hearing aids in the contralateral ear identified speaker sincerity better than those who used implants alone. These results suggest that measuring speech recognition alone in CI users does not fully describe the outcome. Our findings stress the importance of measuring social communication functions in people with CIs.


Introduction
Everyday social interactions are crucial for emotional well-being and maintaining relationships. To convey communicative messages, speakers use an array of verbal and nonverbal cues, such as discourse context, prosody, and facial expressions, to support language comprehension. However, sometimes these cues are inconsistent with each other, for example, when a they said?). Below we state our hypotheses on how these manipulated variables might affect perception. First, we hypothesized that for CI users, the deficit of perception of speaker intentions would be particularly salient for the perception of insincere (sarcasm, teasing) relative to sincere intentions. Besides the "truth bias", where we often assume that speakers are telling the truth [30], the insincere intentions may be particularly difficult for CI users. They must decode the possibly more complex and variable prosody in the insincere utterances to determine if they are incongruent with content [6,7]. The difficulties with insincere intentions may be reduced by the availability of visual or context cues. The effect of visual information on auditory perception has been well established [e.g., 31,32]. Most and Aviner [33] demonstrated that when auditory and visual information are congruent, performance in the audio-visual condition was better than audio-only but quite comparable to visual-alone condition among adolescent CI users. This suggests that visual cues were dominant in these tasks when the auditory cues are less salient [33]. Fengler and colleagues [34] provided further evidence of visual dominance, reporting greater interference of incongruent facial expressions in congenitally deafened CI users than the control group. For our tasks, the subjects might use different visual cues, i.e., lip reading or observing facial expressions, for perceiving content and speaker sincerity. Still, we hypothesized that performance would be better with additional visual information for both tasks. Contextual cues are especially important for comprehending ironic statements [35] and may reduce subjects' response time to insincere utterances [36]. We hypothesized that discourse context would greatly facilitate performance, particularly for insincere intentions and when stimuli are presented without visual cues.
Lastly, the measured behavioral results were correlated with key demographic variables (e.g., duration of hearing loss) and device features. Since duration of deafness has been consistently identified to predict CI outcomes [37,38], we hypothesized that perception in the audio-only condition would depend on subjects' duration of hearing deprivation. We also hypothesized that CI users who have residual acoustic hearing in the contralateral ear and take advantage of pitch information provided by a hearing aid would perform better than those without access to acoustic hearing.

Subjects and hardware
Sixteen cochlear implant users participated in the study. Six subjects were sequentially bilaterally implanted and had no residual hearing in either implanted ear. Five subjects were bimodal users, wearing a cochlear implant on one side and a hearing aid on the contralateral side. All participants were adult, post-lingually deafened, native English-speaking users of either Cochlear© (Cochlear Corporation, Englewood, CO) or Advanced Bionics (Advanced Bionics, Valencia, CA) devices. TH16 had early-onset perilingual hearing loss and became proudly deaf when he was an adult. Subjects' mean age at the time of testing was 68.50 years, the mean duration of hearing loss was 33.29 years, and the mean CI experience was 8.49 years. Demographic information for participants and test ears is shown in Table 1. Duration of hearing loss was defined in the study as the time between the onset of hearing loss and implantation. All subjects provided written informed consent before taking part in the study. This study was approved by the East Carolina University Institutional Review Board.

Stimuli
The stimuli were chosen from a validated video inventory for testing social language perception [9]. The inventory has been previously used in studies with young adults [39], typically developing children [40], and older adults [41]. It consists of short video recordings depicting social exchanges intended to be sincere (positive and negative), and insincere (sarcasm and teasing). The videos in the complete published RISC inventory had been validated previously with 31 adult participants (mean age = 23.21 years, SD = 3.88). For the specific subset of 96 videos used in the current study, young typical-hearing adults in the study by Rothermich and Pell [9] identified the speaker's intention with an average accuracy of 84.26% correct (literal positive: M = 87%, SD = 17%, literal negative: M = 96%, SD = 3%, sarcastic: M = 76%, SD = 24%, teasing: M = 89%, SD = 4%). The different attitudes in the videos are expressed by using prosodic cues, facial expressions, and body language.
We have described the acoustic properties of the stimuli in more detail in a previous publication [see Table 2 in 40]. The stimuli used for the present study were selected from these videos based on the criterion that they should be comprehensible in the audio-only version. The selected stimuli consisted of videos recorded from 24 scenes x 4 intentions (94 trials in total) depicting a couple (a female person and a male person) having a conversation. In each scene, the final statement is produced with different intonations and visual expressions to convey four different intentions (literal positive, literal negative, sarcasm, and teasing). Of the 24 scenes, verbal context was provided in 15 scenes. Example scenes are shown in Table 2. Each of the 96 videos was transcribed, and the transcriptions were used to compare to the subjects' responses (see details under Procedures). Sounds of the 96 videos were extracted and processed to have equal RMS (root mean square) values in MATLAB. The stimuli were presented in audio-only and audio-visual conditions; for a total of 192 stimuli: 4 intentions × 24 scenes × 2 presentation modes.

Procedures
Subjects were seated in a sound-attenuated booth. The audio signals were delivered from a loudspeaker placed 1 meter from the head of the subject at 0 azimuths at 65 dB (A). The videos were displayed on a widescreen monitor placed right below the speaker. The bilaterally implanted subjects used both of their processors during the experiment. The bimodal subjects used both their implant and hearing aid (on the side contralateral to the implant) during the experiment. The rationale for allowing the subjects to use both ears, rather than testing just the implanted ear or testing a single implanted ear, was to mimic real-life situations, where the subjects would use all devices available for such social interactions. MATLAB was used to create a user interface for delivering the test and collecting subjects' responses. The stimuli were presented in two blocks (audio-only and audio-visual). The audio-only condition was always tested first, followed by the audio-visual condition. The audio-visual condition was presented second because this condition was expected to be easier for the subjects, especially those who lip-read. Thus, the visual context was provided in a second block to avoid over-familiarization with the stimuli. Within each block, the stimuli were fully randomized. Different randomizations were used if the first audio-visual stimulus was the same as the last audio-only stimulus. Each stimulus was presented to the subject as many times as needed. We acknowledge that this does not mimic a real-life situation, but stimuli were presented multiple times to avoid a floor effect. When the subjects were ready, they first described the conversation content in their own words to the best of their ability, i.e., what did they say? The verbal description was typed either by the subject or by the experimenter, and the response was saved for offline analysis. After describing the conversation's content, the subject was then asked about the speaker's intention, i.e., did they mean what they said? The answer options were "Yes" and "No"; "Yes" would be a correct response if the intentions were sincere (literal positive or negative). "No" would be a correct response if the intentions were teasing or sarcasm. Subjects were encouraged to take frequent breaks during the experiment. The average testing time was 6 hours, depending on how often the stimuli were repeated. Offline, the subject's description of the conversation content was compared to the transcript of each stimulus. Three raters independently rated the subject's description as correct or incorrect. The response was entered into the analysis of speakers' intention only if at least two raters agreed that the subject understood the conversation content. This was based on the assumption that it would be impossible for subjects to perceive, even with visual cues, the intention underlying such complex social exchanges if they did not understand the content in the first place.

Results
The dark-colored bars (dark red and blue) in Fig 1 show   Perception of speaker sincerity was then quantified as the number of correctly perceived sincerities relative to the number of correctly perceived contents (Fig 2). The percentages inform performance at extracting speaker sincerity given the content was perfectly understood. Pearson's correlations suggest that perception of speaker sincerity was not correlated with perception of content under either audio-only [r = -0.16, p = 0.61] or audio-visual conditions [r = 0. 021, p = 0.96].
Main effects of speaker intention categories (literal positive, literal negative, teasing, and sarcasm), presentation mode (audio-only vs. audio-visual), and availability of context on The benefit of visual cues for the perception of speaker intentions collapsed across conditions was correlated with subjects' performance in the perception of content in the audio-only condition (r = 0.76, p = 0.003) (Fig 3, left panel). The benefit of visual cues was quantified by calculating the difference in performance between the audio-visual and audio-only conditions. The additional visual information was more likely to help subjects with perceiving speaker sincerity if they did not struggle with understanding the audio-only stimuli. Further, no relationship was found between the benefit of visual cues for understanding the content and that for understanding speaker's sincerity [r = 0.17, p = 0.58].
Next, the effect of device features on the perception of content and speaker sincerity was analyzed. Subjects differed in their device types, i.e., unilaterally implanted, bilaterally implanted, and bimodal users. A one-way repeated-measures ANOVA showed no significant effect of device type on perception of content in the audio-only condition (F (2,15) = 0.25, p = 0.79). However, a significant effect of device type was found on the perception of speakers' sincerity in the audio-only condition [F (2,12) = 6.42, p = 0.02]. Post-hoc analysis further showed that the bimodal subjects outperformed both the unilaterally and bilaterally implanted subjects (one-tailed, all p < 0.05). When examining just the data without context, the device effect disappeared partially because this challenging listening condition resulted in a much smaller sample size due to missing data [F (2,12) = 0.92, p = 0.43].
The relationship between performance and the subjects' demographic variables was weak. Longer duration of hearing loss was associated with worse perception of content in the audioonly condition [r = -0.60, p = 0.02] (Fig 3, right panel). No other demographic variables, such as age and duration of CI use, were predictive of perception of content or speaker sincerity (all p > 0.05).

Discussion
The current study examined adult CI users' perception of speaker sincerity in complex social interactions. Our results indicated a deficit, in that even when the CI users understood the content of the conversation perfectly; they were not always able to extract the underlying speaker sincerity. The deficit was more salient for identifying insincere versus sincere intentions or when visual cues or verbal context cues were absent. Visual information was more likely to help those who did not struggle with the content of the conversations, which could suggest a competition of cognitive resources. A shorter duration of hearing loss helped subjects understand the content, while access to pitch information via acoustic hearing benefited their perception of speaker sincerity. Below we provide a more detailed discussion of our findings.

Relationship between perception of content and speaker sincerity
It is clear from Fig 1 that subjects who understood the interactions in terms of the content did not always correctly perceive the speaker's sincerity, confirming a deficit in CI users of their ability to understand social aspects of language. More importantly, perception of content and speaker sincerity was not correlated. These data suggest that the acoustic cues required for understanding segmental speech are different from those for understanding the suprasegmental cues that signal a speakers' sincerity. Further evidence came from the lack of correlation between perception of speaker sincerity with verbal (segmental) context and perception of speaker sincerity without context. For the latter, perception would depend solely on speech prosody. Thus, the subjects' access and utility of the two sets of cues may not always be linked. These results suggest that measuring segmental speech perception in CI users may not fully describe the implant's efficacy for providing social communication for its users.

The effect of visual information
Our results revealed that the addition of visual information facilitated comprehension of the content with a large effect size. The large effect size could partially result from an order effect because the audio-visual condition was always presented after the audio-only condition. For perceiving speaker sincerity, the benefit of visual cues was consistent with previous reports [34], but the effect was smaller for stimuli with verbal context than without. The results could suggest that the CI subjects put a greater weight on the verbal context cues than the visual cues if the verbal context was available. Visual information may be redundant if a given context is supportive enough for a literal or nonliteral interpretation of an utterance. However, examining the data in Fig 1 (upper panel), the smaller visual effect could also be due to a ceiling effect, where the performance with verbal context in the audio alone condition was already rather good. In future studies, introducing incongruent visual and context cues could be used to determine the exact weighting between these cues.
The mechanism underlying the visual effect for the perception of content and speaker sincerity might be different. Looking at the speaker's face may have helped the subjects understand the content via lip-reading. In contrast, for the perception of speaker sincerity, subjects might observe the speakers' facial expressions and subtle body gestures. The fact that there was no correlation between the effect of visual cues for understanding content and speaker sincerity supports the idea that the listeners were using different visual cues for the two tasks. Further, our results showed that additional visual information tended to help those who did not struggle with understanding the content of the conversation in the audio-only condition (Fig  3, left panel). In the present tasks, the subjects must first understand the content of the conversation to be able to answer the question if the speaker meant what they said. We speculate less auditory listening effort used to understand the content might free up cognitive resources with which the visual cues can be processed and ultimately be used to identify sincerity. A similar suggestion has been put forward by Chatterjee and colleagues [42]; that obligatory speech perception might take away cognitive resources from more complex tasks such as emotion recognition.

Effect of speaker intention category
Our results also indicated that the subjects identified sincere intentions better than teasing and sarcasm. The effect was more prominent when the stimulus did not provide a verbal context. One reason for this finding could be a so-called "truth bias" [30]. It posits that by default, we assume that conversation partners tell the truth, i.e., that they mean what they said. Therefore, it is possible that subjects were biased to believe that the speaker is being sincere since that represents the unmarked intention. The incongruent/ambiguous nature of sarcasm and teasing statements could present a challenge for listeners generally. The difficulties could also be due to acoustic differences between the stimuli in that prosody in the insincere utterances was more complex and variable. Nonetheless, unlike basic emotions such as "happy" and ''sad", it may be difficult to quantify these acoustic differences due to a lack of a stereotyped "ironic voice". Of all intentions, teasing was the least correctly identified. This was in line with previous results in that teasing is harder to infer during social communication compared with sarcasm [39,43]. We attribute this to several factors. It could be the frequency in daily lifesarcasm occurs more frequently than teasing and is recognized faster and with higher accuracy [44][45][46]. This dichotomy between sarcasm and teasing perception is often referred to as an "asymmetry of affect" [45] between these two types of irony. It indicates that while sarcasm still alludes to social norms of politeness by using positive language, teasing is riskier since it does not adhere to these norms on the surface level [47]; thus, it is harder to recognize.

Demographic variables
The participants in the present study used different device configurations, i.e., some were bilaterally implanted; some were unilaterally implanted and completely deaf in the contralateral ear; the rest of the participants were unilaterally implanted, had residual acoustic hearing in the contralateral ear, and used a hearing aid. All participants used their hearing devices during the tasks, as they would in a real-life listening situation. The most interesting yet somewhat anticipated results were that the three groups performed the same regardless of their device type for the perception of content. However, the bimodal users outperformed the other two groups and benefited from using their hearing aids to perceive speaker sincerity. Note that these effects were measured in the audio-only conditions. We could not confirm the device's effect on performance for just the stimuli without context. Four subjects could not perform the task (0% on content), greatly reducing the sample size for comparing three groups. Generally, the data provided evidence that the acoustic cues for perceiving segmental (content) versus suprasegmental (sincerity) speech do not overlap considerably and a hearing aid provides better access to the suprasegmental information. There is substantial evidence to indicate that amplified acoustic information, combined with electrical stimulation, consistently improves CI users' pitch perception. Even if acoustic information is often spectrally smeared and mismatched with what is provided by the implant, this benefit has been consistently demonstrated in lexical tone perception [48,49], music perception [50], and perception of competing speech [51,52].
Subjects' duration of hearing deprivation before implantation was associated with their performance in the content task in the audio-only condition, but not predictive of their perception of the sincerity or perception in the audio-visual conditions. These findings suggest that hearing deprivation before implantation may not be a strong factor driving individual variances in speech prosody perception but plays an important role in the perception of segmental information. Lastly, our subject sample were older adults. It is possible that the declining cognitive functions have contributed to performance. Further studies are warranted to investigate the interactions between the factors of aging, cognitive function, hearing devices on prosody processing using a neural prosthesis.

Conclusion
Our data suggest that CI users' ability to perceive speaker sincerity is impaired and this impairment is not related to their performance in understanding segmental speech or the content of the conversations. Such deficit may be alleviated if the speakers provided a verbal context to their ironic statement or provided nonverbal body language cues. The deficit may also be alleviated if listeners used a bimodal system where the hearing aid provided resolved pitch information. Our data suggest that evaluating CI outcomes using only speech perception measures does not fully describe their ability for social communication. The outcomes will inform new directions in rehabilitation schemes that enable CI listeners to capitalize on multimodal cues and the combination of acoustic and electric stimulation to optimize social communication with the device.